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[_L] Abstract 

The Single Cut or Join (SCJ) operation on genomes, generalizing chromo- 
some evolution by fusions and fissions, is the computationally simplest known 
model of genome rearrangement. While most genome rearrangement prob- 
lems are already hard when comparing three genomes, it is possible to com- 
pute in polynomial time a most parsimonious SCJ scenario for an arbitrary 
C number of genomes related by a binary phylogenetic tree. 

Here we consider the problems of sampling and counting the most par- 
simonious SCJ scenarios. We show that both the sampling and counting 
problems are easy for two genomes, and we relate SCJ scenarios to alternat- 
ing permutations. However, for an arbitrary number of genomes related by a 
binary phylogenetic tree, the counting and sampling problems become hard. 
We prove that if a Fully Polynomial Randomized Approximation Scheme or 
a Fully Polynomial Almost Uniform Sampler exist for the most parsimonious 
SCJ scenario, then RP = NP. 

The proof has a wider scope than genome rearrangements: the same 
result holds for parsimonious evolutionary scenarios on any set of discrete 
characters. 

Keywords: MSC codes: F.2.2: Computations on discrete structures, G.2.1: 
Counting problems, free keywords: Single cut and join, FPAUS, FPRAS, 
non-approximability 
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1. Introduction 



The genome rearrangement problem is one of the oldest optimization 



problems in computational biology. It has been already formulated byjStuxte 
vant and Novitski (1941). It consists in finding the minimum number of 



rearrangement events that can explain the gene order differences between 
two genomes. According to how genomes and rearrangements are defined, a 
number of variants have been studied (Fertin et al. , 2009). In many cases, 



efficient algorithms running in polynomial time exist for finding one solution, 
but they do not scale up to three genomes: finding a median, i.e., a genome 
minimizing the sum of the number of rearrangements to the three others, is 
almost always NP — hard. 

Moreover, one solution is not representative of the whole optimal solution 
space. So another computational problem is to find all minimum solutions. 
But the number of minimum solutions is often so high that their explicit 
enumeration is not possible in polynomial running time. A small number of 
samples coming from (almost) the uniform distribution is usually sufficient for 



testing evolutionary hypotheses like the Random Breakpoint Model (Alek- 



seyev and Pevzner 2010 



of inversions (Ajana et al. , 2002; Darling et al. 



from one scenario or from a biased sample shou 



Bergeron et al.[ 2008|) or the sizes and positions 

Drawing conclusions 



2008) 



d be avoided as it might be 



very misleading (Bergeron et al. , 2008; Miklos and Darling, 2009). 



Statistical methods, like Markov chain Monte Carlo methods, can sample 



genome rearrangement scenarios (Darling et al. , 2008 Durrett et al. , 2004 



Larget et al. 2002, 2005 Miklos and Tannier, 2010), but often there are no 



available results for their mixing time. Only in the case of the Double Cut- 
and-Join (DCJ) rearrangement model, a Fully Polynomial time Randomized 
Approximation Scheme (FPRAS) and a Fully Polynomial Almost Uniform 
Sampler (FPAUS) are available for counting and sampling most parsimonious 



rearrangement scenarios between two genomes (Miklos and Tannier, 2012) 



But this is hardly generalizable to more than two genomes because for DCJ 
the median problem is NP — hard (Tannier et al. , 2009). 



Recently, a simpler rearrangement model has been published by Feijao 



and Meidanis (2011) under the name Single Cut or Join, or SCJ. It consists 



in a gain and loss process on gene adjacencies, and from a chromosomal point 
of view, allows fusions and fissions, linearization of circular chromosomes 
and vice versa. The computational simplicity of this model is highlighted by 
the existence of an easy polynomial running time algorithm for the median 



problem. More generally, finding a most parsimonious SCJ scenario on an 
arbitrary evolutionary tree (the small parsimony problem) is also polynomial. 

Therefore, it is reasonable to assume that at least stochastic approxima- 
tions are available for the number of most parsimonious SCJ scenarios. We 
show here that it is the case for two genomes. However, we report a negative 
result for the small parsimony problem: the number of most parsimonious 
SCJ scenarios cannot be approximated in polynomial time even in a stochas- 
tic manner unless RP = NP. This bounds the possibilities of using this 
model for genomic studies. 

The paper is organized as follows. The next Section formally introduces 
useful vocabulary in genome rearrangement and random algorithm complex- 
ity. In Section [3] we show that counting and sampling SCJ scenarios between 
two genomes is easy, and show the relation with the so-called Andre's prob- 
lem on alternating permutations. The hardness theorems for an arbitrary 
number of genomes are stated and proved in Section |4} The paper ends with 
a discussion on the impact of these results and the statements of some related 
open problems. 

2. Genome rearrangement: finding, counting, sampling 

2.1. Genome rearrangement by SCJ 

Definition 1. A genome is a directed, edge-labelled graph, in which each 

vertex has a total degree at most 2, and each label is unique. Each edge is 

called a gene. The beginning of an edge is called tail, the end of an edge is 

called head, the joint name of heads and tails is extremities. The vertices 

with degree 2 are called adjacencies, the vertices with degree 1 are called 

telomeres. 

By definition, a genome is a set of disjoint paths and cycles, and neither 
the paths nor the cycles are necessarily directed. The components of the 
genome are the chromosomes. An example for a couple of genomes is drawn 
on Figure [TJ All adjacencies correspond to two gene extremities and telomeres 
to one. For example, (hl,t3) describes the vertex of genome G2 in Figure [I] 
in which the head of gene 1 and the tail of gene 3 meet, and similarly, {hi) 
is the telomere where gene 7 ends. A genome is fully described by a list of 
such descriptions of adjacencies and telomeres. 

We will study several genomes simultaneously. We always assume the 
genomes we compare have the same label set. It means they are required to 
have exactly the same gene content. 



o 
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G1 



G2 



Figure 1: An example of two genomes with 9 genes. 

Definition 2. A Single Cut or Join (SCJ) operation transforms one genome 
into another by modifying the adjacencies and telomeres in one of the follow- 
ing 2 ways: 

• take an adjacency (a,b) and replace it by two telomeres, (a) and (b). 

• take two telomeres (a) and (b), and replace them by an adjacency (a, b). 
Given two genomes G\ and G2, it is always possible to transform one into 



the other by a sequence of SCJ operations (Feijao and Meidanis, 2011 ). Such 
a sequence is called an SCJ scenario for G\ and G 2 . Scenarios of minimum 
length are called most parsimonious, and their length is the SCJ distance 
and is denoted by d§cj{G\, G 2 ). 



The adjacency graph was introduced by Bergeron et al. (2006) to com- 
pute the DCJ distance between two genomes. It can be used to study SCJ 
scenarios as well: 

Definition 3. The adjacency graph G(ViUV2, E) of two genomes G\ and G<i 
is a bipartite multigraph in which V\ is the set of adjacencies and telomeres 
of G\ and V2 is the set of adjacencies and telomeres of G<i- The number of 
edges between u & V\ and v G V2 is the number of extremities they share. 

Each vertex of the adjacency graph has either degree 1 or 2, and thus, 
the adjacency graph falls into disjoint cycles and paths. Each path has one 
of the following three types: 

• odd path, containing an odd number of edges and an even number of 
vertices, 

• W -shaped path, which is an even path with two endpoints in V\ 



(h2,h3) (h4,h5) (h9,t8) (t6,t7) (t4) 




(t5,h6) (h8,t9) (h7) 



(t1,t2) (h1,t3) (h2) (h3,h4) (h5) (h9,t7) (t6,t8) (t4,t5) (h6) (h8,t9) (h7) 

Figure 2: The adjacency graph of the two genomes on Fig. [T] 



• M -shaped path, which is an even path with two endpoints in V2 

In addition we call trivial components the cycles with two edges and the 
paths with one edge. An adjacency graph example can be seen on Figure [2} 

2.2. Counting and Sampling SCJ scenarios 

Definition 4. A decision problem is in NP if a non- deterministic Turing 
Machine can solve it in polynomial time. An equivalent definition is that a 
witness proving the "yes" answer to the question can be verified in polynomial 
time. A counting problem is in #P if it asks for the number of witnesses of 
a problem in NP. 

Definition 5. A decision problem is in RP if a random algorithm exists 
with the following properties: a) the running time is deterministic and grows 
polynomially with the size of the input, b) if the true answer is "no" , then 
the algorithm answers "no" with probability 1, c) if the true answer is "yes" , 
then it answers "yes" with probability at least 1/2. 

Definition 6. The Most Parsimonious SCJ scenario problem (MPSCJ) is 
to compute dscj(Gi, G2) for two genomes G\ and G2 given as input. The 
T^MPSCJ problem asks for the number of scenarios of length d$cj(Gi,G2) 7 
denoted by #MPSCJ(Gi, G 2 ). 

For example, the SCJ distance between the two genomes of Figure [T] is 
12 and there are 16 x ( 3 3 4 2 ) different scenarios. 

MPSCJ is an optimization problem, which has a natural corresponding 
decision problem asking if there is a scenario with a given number of SCJ op- 
erations. So we may write that #MPSCJ 6 #P, which means that #MPSCJ 
asks for the number of witnesses of the decision problem "Is there a scenario 
for G\ and G2 of size g?scj(Gi, G2) ?"■ 



Definition 7. Given a rooted binary tree T(V, E) with k leaves, and genomes 
G\, G-ii • ■ ■ , Gk assigned to the leaves, the small parsimony SCJ problem fSPSCJj 
asks for an assignment of genomes to the internal nodes of T and an SCJ 
scenario for each edge, which minimize the number of SCJ operations along 
the tree, i.e., 

J2 dscj(Gi,Gj) C 1 ) 

where Gi (Gj) is the genome which is assigned to vertex v $ G V (vj G V). 
The small parsimony term is borrowed from the well-known textbook prob- 



lem (Jones and Pevzner, 2004), the small parsimony problem of discrete 
characters: Given a rooted binary tree T(V, E) with k leaves labelled by char- 
acters from a finite alphabet, label the internal nodes such that the number 
of edges labelled with different characters at their two ends is minimized. 

The solution space of the most parsimonious SCJ scenarios on a tree 
consists of all possible combinations of assignments to the internal nodes 
together with the possible SCJ scenarios on the edges of the phylogenetic tree. 
The ^SPSCJ problem asks the size of this solution space. 

As the decision version of SPSCJ is trivially in NP, ^SPSCJ is in $P. 
There are subclasses in #P containing counting problems which are approx- 
imable by polynomial deterministic or randomized algorithms. 

Definition 8. A counting problem in #P is in FP if there is a polynomial 
running time algorithm which gives the solution. It is #P — complete if any 
problem in jfP can be reduced to it by a polynomial-time counting reduction. 

Definition 9. A counting problem in #P is in FPRAS f Fully Polynomial 
Randomized Approximation Scheme^ if there exists a randomized algorithm 
such that for any instance x, and e, 5 > 0, it generates an approximation f 
for the solution f , satisfying 

pM^</</(l + e)) >1-S (2) 

and the algorithm has a time complexity bounded by a polynomial of \x\, 1/e 
and — log(<5). 

The total variational distance dTv{Pi tt) between two discrete distributions 
p and it over the set X is defined as 

d TV (p, tx) := - Y^ \p{x) - tt{x) I (3) 



Definition 10. A counting problem in ^P is in FPAUS if there exists a 
randomized algorithm (a Fully Polynomial Almost Uniform Sampler that is 
also abbreviated as FPAUSj such that for any instance x, and e > 0, it 
generates a random element of the solution space following a distribution p 
satisfying 

d T v(p,U)<e (4) 

where U is the uniform distribution over the solution space, and the algorithm 
has a time complexity bounded by a polynomial of \x\, and — log(e). 

3. Most parsimonious SCJ scenarios between two genomes 

3.1. A dynamic programming solution 

The SCJ distance can be calculated in polynomial time, as stated in the 
following theorem. 



Theorem 11. (Feijao and Meidanis (2011)) Let 111 denote the set of adja 



cencies in genome G\ and let II 2 denote the set of adjacencies in genome G 2 . 
Then 

dscj(Gi,G 2 ) = |IIiAII 2 | (5) 

where A denotes the symmetric difference of the two sets. 



Theorem [TT] says that any shortest path transforming G\ into G 2 has to 
cut all the adjacencies in G\ \ G 2 and add all the adjacencies in G 2 \ G\, and 
there are no more SCJ operations. Drawing one solution is easy: first cut 
all adjacencies in G\ \ G2, then join all adjacencies in G 2 \ G\. But if we 
want to explore the solution space, we have to observe that if an adjacency 
(a, b) exists in G\ \ G 2 and an adjacency (a,c) exists in G 2 \ G±, then first 
adjacency (a, b) must be cut to create telomere (a), and then telomere (a) 
can be connected to telomere (c). Similarly, if extremity c belongs to an 
adjacency in G\ \G 2 , then it must be also cut before connecting the two 
telomeres. Therefore there are restrictions on the order of cuts and joins. 

The allowed order of cuts and joins can be read from the adjacency graph: 
When an SCJ operation acts on G\ and thus creates G[, it also acts on the 
adjacency graph of G\ and G 2 by transforming it into the adjacency graph 
of Gi and G 2 . Therefore the transformation of G\ into G 2 can be seen as 
a transformation of the adjacency graph into trivial components. We say 
that an SCJ scenario sorts the adjacency graph if it transforms it into trivial 
components. As any SCJ operation in a most parsimonious scenario acts on 



a single component, we say that the set of SCJ operations acting on that 
component sort it if they transform it into trivial components. 

We first give the way of computing the number of scenarios for sorting 
one component. Then the number of scenarios for several components will 
be deduced by a combination of scenarios from each component. 

Let W(i) (respectively M(i), 0(i) and C(i)) denote the number of most 
parsimonious SCJ scenarios sorting a VF-shaped path (respectively M-shaped 
path, odd path, cycle) with i adjacencies in G\. The following dynamic pro- 
gramming algorithm allows to compute all these numbers. 

For a trivial component, no SCJ operation is needed so there is only one 
solution: the empty sequence. This gives 

C(l) = 1 (6) 

O(0) = 1 (7) 

The smallest W^-shaped path has adjacency in G\ and one in Gi- There 
is a unique solution sorting it: add the adjacency. This gives 

W(0) = 1 (8) 

A scenario of any other component starts with cutting an adjacency in G±. 
For a W^-shaped path, this results in two PF-shaped paths. For an M-shaped 
path, this results in two odd paths. For an odd path, this results in an odd 
path and a VF-shaped path. For a cycle, this results in a VF-shaped path. 
Each emerging component has fewer adjacencies in G\, and hence, a dynamic 
programming recursion can be applied: the resulting components must be 
sorted and in case of two resulting components, the sorting steps on the 
components must be merged. Hence the dynamic programming recursions 
are 

C(i) = ixW(i-l) (9) 

w ® = i2( 2 j 2 -i) w{j ~ i)w{i ~ j) (io) 



j- 



m « = i(l]_l) o{j ~ i)o{t ~ 3) (n) 



These dynamic programming recursions can be used for counting and sam- 
pling by the classical Forward-Backward phases: in the Forward phase the 
number of solutions is calculated, and in the Backward phase one random 
solution is chosen based on the numbers in the sums. 

So it is possible to compute W(i), M(i), 0{i) and C{i) in polynomial 
time and to sample one scenario from the uniform distribution. We can then 
count and sample for several components by adding a multinomial coefficient. 

Theorem 12. Let G\ and G 2 be two genomes with adjacency graph AG. As- 
sume AG contains i M -shaped paths, with respectively m 1 ,m 2 , . . . ,mj adja- 
cencies in G\; AG contains j W -shaped paths, with respectively w\, w 2 , ■ ■ ■ , Wj 
adjacencies in G\; AG contains k odd paths, with respectively Vi,v 2 , ■ ■ ■ ,Vk 
adjacencies in G\; and AG contains I cycles, with respectively c\,c 2 , . . . ,q 
adjacencies in G\. The number of most parsimonious SCJ scenarios from 
G\ to G 2 is 

(EUlC 2 ^. " !) + ELl( 2 ^n + 1) + ELli^n) + EU( 2 Cn)) 



X 



n; =1 (2m n - 1)! m=i(2^n + i)i u: =1 (^ n y. nLi^y. 

i j k l 

x Yl M{n) Yl W{n) \ J 0{n) J[ C{n) (13) 

n=l n=l n=l n=l 

Sampling a scenario from the uniform distribution is then achieved by 
generating a random permutation with different colours and indices, one 
colour for each component, and then wipe down the indices so get a permu- 
tation with repeats. For each component, its sorting steps must be put into 
the joint scenario indicated by the colour of the component. 

We can then state the following theorem settling the complexity of the 
comparison of two genomes by SCJ. 

Theorem 13. #MPSCJ is in FP and there is a polynomial algorithm sam- 
pling from the exact uniform distribution of the solution space of an MPSCJ 
problem. 

3.2. Alternating permutations 

The solutions to #MPSCJ for single components are also linked to the 
number of alternating permutations, for which finding a formula is an old 
open problem. An alternating permutation of size n is a permutation C\, . . . , c n 



of {1, . . . ,n} such that C2%-\ < c 2 i and c 2 % > c 2 i+\ for all i (Andre, 1881). For 

9 



example, if n = 4, the permutation 1,3,2,4 is an alternating permutation 
but 1, 3, 4, 2 is not because 3 is less than 4. The number of alternating per- 
mutations of size n is denoted by A n and finding these numbers is known as 
Andre's problem. 

We show that computing SCJ scenarios is closely related: 

Theorem 14. 

M(k) = A 2k _ 1 

W(k) = A 2k+1 
0(k) = A 2k 

C(k) = kx A 2k -i 

Proof. We prove only the first line, the second and the third lines can be 
proved the same way. The proof of the last line comes from the fact that a 
cycle with k adjacencies can be opened in k different ways into a W-shaped 
component with k — 1 adjacencies. Let the adjacencies in the G% part of 
the M-shaped component be (21, 2 2 ), (23,24), . . . (x 2k _i,x 2k ). Any SCJ sce- 
nario sorting these must cut all these adjacencies and must create adjacencies 
(2 2 , 23), (24, 25), . . . (2 2 fc-2, %2k-i)- Let us index the SCJ operations in a sce- 
nario, and let 7r 2i _i be the index of the SCJ step which cuts the adjacency 
(22J-1, 221), and let n 2i be the index of the SCJ step which joins 221 and 22^+1 . 

In any most parsimonious SCJ sorting the M-shaped component, 7r 2 j_i < 
ir 2i and 7T2j + i < 7r 2 i, so 7r is an alternating permutation. Hence the number 
of sorting scenarios is at most A 2k _i. 

On the other hand, for any alternating permutation of size 2k — 1, we can 
construct a sorting scenario in which the indexes come from the alternating 
permutation. Since the sorting scenarios for different alternating permuta- 
tions are different, the number of SCJ scenarios is at least A 2k _\. □ 

4. Counting and sampling SCJ small parsimony solutions 

The SPSCJ problem is in P, since one optimal assignment of genomes 



to the internal nodes can be drawn in polynomial running time, (Feijao and 



Meidanis, 2011). However, we show that estimating the size of the solution 
space, as well as uniformly sampling it, is hard. 

We show first that there is no polynomial running time algorithm which 
samples almost uniformly from the solutions unless RP = NP: 

10 



Theorem 15. #SPSCJ G FPAUS => RP = NP. 



Then our conjecture is that #SPSCJ G #P — complete, but we can prove 
only a slightly weaker result 

Theorem 16. #SPSCJ G FP =>■ P = NP. 

Stochastic counting (FPRAS) and sampling (FPAUS) are equivalent for 



self-reducible problems [(Jerrum et al. , 1986), see the quite technical defini- 
tion of self-reducibility there] . However the counting counterpart of Theorem 
15 cannot be immediately deduced from it because we miss a proof of self- 
reducibility for ^SPSCJ, which seems far from trivial, even not true in that 
case. So we have to prove this counting counterpart independently. 



The construction we use in the proof of Theorem [15] shows the hardness 
of a more specific problem and can be adapted to prove that: 

Theorem 17. #SPSCJ G FPRAS ^ RP = NP. 

We first recall in the following subsection how to draw one particular 
solution and then how to build all possible solutions. Then we show how 
to generate an RP algorithm for 3SAT using an FPAUS algorithm for the 
#SPSCJ problem. Since 3SAT G NP — complete, this construction proves 



Theorem 15 This section finishes with proving Theorems 16 and 17 



4-1. The Fitch and Sankoff solutions 

Let LTi, n 2 , . . . , Ilfc denote the adjacency sets of genomes Gi, G2, ■ ■ ■ Gk 
and let 



n = u* =1 n, 



(14) 



Feijao and Meidanis (2011) proved that the parsimony score is equal to 



the sum of the scores for each particular adjacency a G IT. This can be 
computed by solving the small parsimony problem for a discrete character. 
Although this is mainly textbook material, we recall the principles of the 
standard algorithms solving this problem for one adjacency because some 
stages will be referred to in the hardness proof. For one adjacency, the small 



parsimony problem is solved by Fitch's algorithm (Fitch, 1971). Its principle 



is first to assign sets ({0}, {1} or {0, 1}) to every node of the tree, visiting 
the nodes of the tree in post-order traversal, ie. first the leaves of the tree 
and then the parents of each node. At the leaves of the tree, {0}s and 
{l}s are assigned according to the pattern of presence or absence of a in 



11 



the corresponding genomes. Let B(a,u) denote the set assigned to node u 
regarding adjacency a. Fitch's algorithm applies the recursion 

JB(a,v 1 )nB(a,v 2 ) if B(a, Vl ) n B(a,v 2 ) ^ , 1K x 
B{a,u) = < (15) 

lB(a^i)UB(a,t)2J otherwise 

where v\ and f 2 are the children of u. 

Definition 18. We say that there is an ambiguity for an adjacency a at 
vertex u if B(a,u) = {0,1}. 

Then starting from the root, the nodes are visited in a pre-order traversal, 
and {0} or {1} is assigned to each node according to the following rules: If 
-B(a,root) contains only one element, then it is assigned to the root. If 
B(a,root) = {0,1}, then any of them can be chosen for the root. Once 
the number assigned to the root is fixed, the values are propagated down. 
Let F(a,v) denote the singleton set assigned to the node v for adjacency a. 
Fitch's algorithm applies the recursion: 

_, JF{a,u)nB{a,v) if F{a, u) n B{a, v) ± 

b [a, v) = < (16) 

B(a, v) otherwise 



l ) 



where v is a child of u. 

F(a,v) then always contains exactly one element. Doing this indepen- 
dently for all adjacencies does not guarantee that the collection of present 
adjacencies at each node is a genome: we call a subset S C II a valid genome 
if there is no couple of adjacencies ai,a 2 G X with a common extremity. 



Feijao and Meidanis (2011 ) showed that if the assignments of F(a, root) over 
all possible adjacencies a G II are chosen to be a valid genome, then all 
genomes at the internal nodes are also valid (deduced from Lemmas 6.1. and 



6.2 in Feijao and Meidanis (2011)). They also proved that at least one valid 
assignment exists since the Fitch's algorithm never gives non-ambiguous val- 
ues for adjacencies sharing extremities. 

We call Fitch solutions the genome assignments constructed this way. 
However, they are not the only possible most parsimonious genome assign- 
ments. Some of them cannot be found by Fitch's algorithm. All solutions 
can be found by a generalization of Fitch's algorithm, Sankoff's algorithm 



(Sankoff and Rousseau, 1975). It is a dynamic programming principle which 
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computes two values for each node of the phylogenetic tree: for a leaf Vi 
assigned with genome Gi, 

f . JO ifaelU . . 

sl(a,Vi) = < . (17) 

I oo otherwise 

(o ifa^IL. 
sO(a,Vi) = \ ^ T . l (18) 

I oo otherwise 

and for an internal node u with children v\ and v-i- 

sl(a,u) = m.in{sl(a,vi),sO(a,vi) + 1} -\- 

min{sl(a,f2), s0(a,t>2) + 1} (19) 

sO(a, u) = mm{sO(a,vi),sl(a,Vi) + 1} + 

min{sO(a, t>2), sl(a,t>2) + 1} (20) 

The value of s0(a,u) (respectively sl(a, u)) represents the minimum num- 
ber of edges under the subtree rooted at u which are labelled with different 
presence/absence of a at their two ends in a most parsimonious scenario, 
given that u is labelled with the absence (respectively presence) of a. Then 
min(sl(a, root), s0(a, root)) is the minimum small parsimony solution for ad- 
jacency a, and the assignments to internal nodes are obtained by propagating 



down the values based on which gave the minimum in Equations 19 and 20 



Contrary to Fitch's algorithm, this one explores all possible most parsi- 



monious assignments for a given adjacency (Erdos and Szekely 1994). Un- 
fortunately, in that case there is no guarantee that all of these assignments 
give valid genomes, as Feijao and Meidanis result holds only for Fitch's so- 
lutions. It is an open question how to estimate the number of most parsi- 
monious genome assignments (we can call them the Sankoff solutions), and 
is beyond the scope of this paper (note that it is a different problem from 
^SPSCJ where we aim at estimating the number of SCJ scenarios and not 
only genome assignments). 

4-2. Sampling most parsimonious SPSCJ scenarios is hard 

In this section we construct a problem instance x G SPSCJ for any 3CNF 
formula $ with n variables, such that if there exists an FPAUS for x then it 
is an RP algorithm for deciding whether or not $ is satisfiable. 
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Let $ be a 3CNF with n logical variables and k clauses. We are going 
to construct a tree denoted by T$, and label its leaves with genomes. For 
each logical variable bi we create an adjacency c^. In this construction, 
all adjacencies are independent one from another, namely they never share 
common extremities. So there is no genome validity issue in this construction, 
any assignment of adjacency presence/absence is a valid genome. 

For each clause Cj, we construct a subtree T Cj . The construction is done 
in three phases, see also Figure [3j First, we create a constant size subtree, 
called unit subtree using building blocks we call elementary subtrees. Then 
in the blowing up phase, this unit subtree is repeated several times, and 
in the third phase it is amended with another constant size subtree. The 
reason for this construction is the following: the unit subtree is constructed 
in such a way that if a clause is satisfied, the number of SCJ solutions is 
a greater number, and is always the same number not depending on how 
many literals provide satisfaction of the clause. When the clause is not 
satisfied, the number of SCJ solutions is a smaller number. The blowing up 
is necessary for sufficiently separating the number of solutions for satisfying 
and not satisfying assignments. Finally, the amending is necessary for having 
all adjacencies ambiguous in the Fitch solutions. 

We detail the construction of the subtree for the clause Cj = 61 V b 2 V 
6 3 , denoted by T c . Subtrees for the other kinds of clauses are constructed 
similarly. The unit subtree is built from 76 smaller subtrees that we will call 
elementary subtrees. Only 14 different types of elementary subtrees are in a 
unit subtree, but several of them have given multiplicity, and the total count 
of them is 76, see also Table [Tj Some of the elementary subtrees are cherry 
motives for which we arbitrarily identify a left and a right leaf. On some 
of these cherries, we add one or more adjacencies, called extra adjacencies, 
which are present exactly on one leaf of the cherry and absent everywhere else 
in T$. So the edges connecting these leaves to the rest of the entire tree T$ 
will contain one or more additional SCJ operations in all most parsimonious 
solutions. 

A clause contains 3 logical variables, the unit subtree will be such that 
for the corresponding adjacencies, Fitch's algorithm assigns an ambiguity at 
the root of the subtree T c . , namely 

B(a u root) = {0,1} (21) 

for each bi G Cj. The entire tree, 7$, will also be such that Sankoff solutions 
are all found by Fitch's algorithm, namely, all solutions can be found by the 

14 



amending 




unit subtrees 



elementary subtrees 



Figure 3: Constructing a subtree T c . for a clause Cj. The subtree is built in three phases. 
First, elementary subtrees are connected with a comb to get a unit subtree. In the second 
phase the same unit subtree is repeated several times 'blowing up' the tree. In the third 
phase, the blown up tree is amended with a constant size, depth 3 fully balanced tree. 
The smaller subtrees constructed in the previous phase are denoted with a triangle in the 
next phase. See also text for details. 
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Fitch's algorithm, as we are going to state and prove in Lemma 20 Therefore 
there will be 8 possible genome assignments for the unit subtree, related to 
the 8 possible assignments of the three logical variables at the root. Let 
the presence of the adjacency at the root mean logical true value, and let 
absence mean logical false value. The constructed unit subtree will be such 
that if the clause is not satisfied, the number of possible SCJ scenarios for the 
corresponding assignment on this unit subtree is 2 136 x 3 76 , and if the clause 
is satisfied, then the number of possible SCJ scenarios for each corresponding 
assignment is 2 156 x 3 64 . The ratio of the two numbers is 2 20 /3 12 > 1. We 
will denote this number by 7. This ratio will be the basis for our proof: any 
FPAUS will sample the solutions corresponding to the satisfied clauses more 
often than the non-satisfied ones because the former are more numerous. 
This can be turned into an RP algorithm for 3SAT. 

Below we detail the construction of the elementary subtrees and also give 
the number of SCJ solutions on them since the number of solutions on the 
unit subtree is simply the product of these numbers. 

For the adjacencies «i, «2 and «3, the cherries are the following: 

• for the cherries on which the left leaf contain one extra adjacency, the 
presence/absence pattern on the left and right leaf is given by 

Oil, 100 
101, 010 
110, 001 
000, 111 

The first column shows the presence/absence of the three adjacencies 
on the left leaf, the second column shows the presence/absence of the 
three adjacencies on the right leaf. Hence, for example, 000 means 
that none of the adjacencies is present, 100 means that only the first 
adjacency is present. The number of SCJ solutions on one cherry is 24 
if the assignment of adjacencies at the root of the cherry is the same as 
on the right leaf. Indeed, in that case, 4 SCJ operations are necessary 
on the left edge, and they can be performed in any order. If the number 
of SCJ operations are 3 and 1 respectively on the left and right edges, 
or vica versa the number of solutions is 6. Finally, if both edges have 
2 SCJ operations, then the number of solutions is 4. 
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• There is one cherry without any extra adjacency, and its presence/absence 
pattern is 

000, 111 

If the clause is not satisfied, the number of SCJ solutions on this cherry 
is 6; if all logical values are true, the number of SCJ solutions is still 6; 
in any other case, the number of SCJ solutions is 2. 

This elementary subtree is repeated 3 times. 

• Finally, there are cherries with one-one extra adjacency on both leaves. 
These are two different adjacencies, so both of them need one extra 
SCJ operation on their incoming edge. The presence/absence patterns 
are 

Oil, 100 
101, 010 
110, 001 

If all SCJ operations due to at, i = 1,2,3 falls onto one edge, then the 
number of solutions is 24, otherwise the number of solutions is 12. 

Each of these elementary subtrees are repeated 15 times. 

The remaining elementary subtrees contain 3 cherries connected with a 
comb, that is, a completely unbalanced tree, see also Figure |4} For the cherry 
at the right end of this elementary subtree, we add one or more adjacencies 
that are present on one of the leaves and absent everywhere else in T$. When 
there is one extra adjacency on the left leaf, the adjacencies a\, «2 and 0:3 
are assigned with the following presences/absences on the three cherries at 
the top of the three combs: 

011, 000 
101, 000 
110, 000 

Again, the first column shows the assignment for the left leaf, the second 
column for the right leaf. The number of SCJ solutions is 6 on this cherry 
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c^: 1 


b,: 1 


a,:1 


c^: 1 


a,: 


a,: 


a 2 : 1 


a 2 : 


a 2 : 1 


a 2 : 


a 2 : 1 


a 2 : 


a 3 : 1 


a 3 : 1 


a 3 :0 


a 3 :0 


a 3 : 1 


a 3 :0 


a x :0 


a x :0 


a y :0 


a x :0 


a x : 1 


a x :0 



Figure 4: a) A cherry motif, ie., two leaves connected with an internal node, b) A comb, 
ie., a fully unbalanced tree with 8 leaves, c) A tree with 3 cherry motifs connected with 
a comb. The assignments for 4 adjacencies, a±, a<2, a^ and a x are shown at the bottom 
for each leaf, at, i = 1,2,3 are the adjacencies related to the logical variables 6j, and a x 
is an extra adjacency. Note that Fitch's algorithm gives ambiguity for all adjacencies on 
at the root of this subtree. 



if the assignment at the root is for both adjacencies which has assignment 
1 on the left leaf. In any other cases, the number of solutions is 2. Two of 
the adjacencies are ambiguous on this cherry, and the third one is 0. On 
the remaining two cherries of this elementary subtree, this third adjacency 
is present on all leaves, while the other two are made ambiguous in such a 
way that any assignment has one SCJ scenario on the remaining of the tree. 
We show the solution for the first subtree on Figure [4} 

Each of these elementary subtrees are repeated 3 times. 

Finally, there are elementary subtrees when there is 1 extra adjacency on 
the left leaf and 2 extra adjacencies on the right leaf. The assignments are 

Oil, 000 
101, 000 
110, 000 



The number of SCJ solutions is 24 on this cherry if both necessary SCJ 
operations fall onto the edge having 2 additional SCJ operations due to the 
extra adjacencies, and 12 in all other cases. 

Each of these elementary subtrees are repeated 5 times. 
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Table 1: The number of SCJ scenarios on different elementary subtrees of the unit subtree 
of the subtree T c . for clause Cj = b\ V 62 V b^ . Columns represent the 14 different types 
of components, the topology of the elementary subtree is indicated on the top. The 
black dot means extra SCJ operations on the indicated edge, the numbers represent the 
presence/absence of adjacencies on the left leaf of a particular cherry, see text for details. 
The row starting with =ff indicates the number of repeats of the elementary subtrees. 
Further rows represent the logical true/false values of &jS, for example, 001 means 61 = 
false, 62 = false, 63 = true. The values in the table indicate the number of solutions, 
raised to the appropriate power due to multiplicity of the elementary subtrees. It is easy 
to check that the product of the numbers in the first line is 2 136 x 3 76 and in any other 
lines is 2 156 x 3 64 . 



In this way, the roots of all 76 elementary subtrees are ambiguous for the 
three adjacencies related to logical variables. We connect the 76 elementary 
subtrees with a comb, and thus, all three adjacencies are ambiguous at the 
root of the entire subtree, which is the unit subtree. If the clause is satisfied, 
the number of SCJ scenarios for the corresponding assignment is 2 156 x 3 64 , 
if the clause is not satisfied, the number of SCJ solutions is 2 136 x 3 76 , as 
can be checked on Table HI The ratio of them is indeed 2 20 /3 12 = 7. The 
number of leaves on this unit subtree is 248, and 148 additional adjacencies 
are introduced. 

This was the construction of the constant size unit subtree. In the next 



step, we "blow up" the system. Similar blowing up can be found in Jerrum 



et al. (1986), in the proof of Theorem 5.1. We repeat the above described 



unit subtree \(k\og((n — 3)!) + nlog(2))/log(7)] + 1 times, and connect all 
of them with a comb (completely unbalanced tree). All three adjacencies 
representing the three logical variables in the clause are still ambiguous at 
the root of this blown up subtree, and thus, there are still 8 Fitch solutions. 
For a solution satisfying the clause, the number of SCJ scenarios on this 
blown up subtree is 



X = (2 156 x 3 64 ) 



r fclog((n-3)!) + nlog(2) 
log(7) 



1+1 



(22) 
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and the number of scenarios if the clause is not satisfied is 

r fclog((n-3)!) + nlog(2) -| 

Y = (2 136 x 3 76 ) 1 log ^ ' (23) 

Let all adjacencies not participating in the clause be on this blown up 
subtree. 

We are close to the final subtree T c . for one clause, Cj. In the third phase, 
we amend the so far obtained tree with a constant size subtree. Construct 
a fully balanced depth 3 binary tree, on which all 3 adjacencies which are 
in the clause are ambiguous at the root without making more than 1 SCJ 
scenario on it, similarly to the left part of the tree on Figure |4} All other 
adjacencies not participating in the clause are present at all leaves of this 
tree. 

Here is how to construct T c . for one clause, Cj. Construct an additional 
vertex which will be its root. The left child of the root is the blown up tree, 
while its right child is the depth 3 balanced tree. Denote by T c . this final 
tree for one clause Cj. 

All adjacencies are ambiguous at the root of the subtree T Cj , therefore 
there are 2 n Fitch solutions for the assignments of the internal nodes of T c .. 

Lemma 19. For any assignment of the n adjacencies, if the clause q is 
satisfied, then the number o/SCJ scenarios for the corresponding assignment 
on T c . is at least 

yx((n-3)!fx2"x 7 (24) 

and at most 

Y x ((n-3)!) fc+1 x2 n x 7 (25) 

If the clause is not satisfied, then the number of SCJ scenarios is at most 
Y x (n-3)!. 

Proof. The B values of Fitch's algorithm for the n — 3 adjacencies not rep- 
resenting a logical value in the clause q are all {0} at all the nodes of the 
left child of the root, and all {1} at all the nodes of the right child of the 
root. Therefore in all scenarios there are n — 3 cumulated SCJ operations on 
the two edges going out of the root. If they are all on one of the edges, the 
number of possible SCJ scenarios is (n — 3)!, and in all other cases they are 
less, but at least 1. (Actually, the minimum is (((n — 3)/2)!) , but the very 
loose lower bound 1 is sufficent for our calculations). Then if the clause is 
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satisfied, the number of SCJ scenarios is between X and X x (n — 3)!. Note 
that 

X/Y = ((n-3)!) fc x2™x 7 , 

which gives the stated result. If the clause is not satisfied, the number of 
SCJ scenarios is at most Y x (n — 3)!. □ 

For all k clauses, construct such a subtree and connect all of them with 
a comb. This is the final tree T$ for the 3CNF $. 

All adjacencies corresponding to logical variables are ambiguous at the 
root of the T$, so there are 2 n Fitch solutions. We prove that there is the 
same number of Sankoff solutions. 

Lemma 20. All adjacency assignments for the SPSCJ problem on tree T$ 
are Fitch solutions. 

Proof. There are two types of adjacencies participating in T$. There are 
n of them related to the logical variables in $, the other adjacencies are 
introduced in the construction and are present on exactly one leaf, absent 
everywhere else in T$. 

If an adjacency a x is present only on one leaf, then in any SPSCJ solution 
it is created on the edge connecting the leaf to the remaining part of the tree. 
This solution is provided by Fitch's algorithm. 

The tree is constructed in such way that for all <y,i representing variable 

h e $, 

B(a h v) = {0,1} ^B(a i ,u) = {0,1} (26) 

where u is the parent of v. First observe that 

B(ai, v) = {0, 1} =>• sl(oti, u) = s0(ai, u) (27) 

this means that whenever the two children v\ and i>2 of a node u are ambigu- 
ous in Fitch's algorithm, 

sl(ai,u) = sl(ai,vi) + sl(ai,v 2 ) (28) 

s0(ai,u) = s0(a i: vi) + s^(a i) v 2 ) (29) 

namely, all Sankoff solutions are Fitch solutions. 

Moreover, at any node u where the B value is ambiguous for some adja- 
cency ati, while it is not ambiguous in the children of u, we have 

s0(ai,u) = sl(aii,u) = 1 (30) 

and here again the Fitch solutions are the same as the Sankoff solutions. □ 
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Now we are ready to prove Theorem 15 



Proof. (Theorem 15 ) Let $ be a 3CNF with k clauses. The number of 
Boolean variables in $ is at most 3k, hence the tree T$ contains at most 



248 x 



+ 1 + 



leaves, and 



6k + 296k x 



fclog((3fc-3)!) + 3fclog(2)' 
log(7) 

fclog((3ife-3)!) + 3Jfelog(2) 



x k 



log(7) 



+ 1 



(31) 



(32) 



extremities (twice the number of independent adjacencies appearing). To 



explain Equation [3TJ 248 is the number of leaves on the unit subtree, it 
is reapeted 
mentioned a 



fclog((n-3)!)+3fclog(2) 



log (7) 



+ 1 times, an upper bound for n is 3k, as 

Dove, and there are 8 further leaves in the amending phase of 
the construction of a subtree T c . for a clause Cj. Finally, there are k clauses. 



To explain Equation 32, there is an adjacency for each boolean variable, 
there are at most 3k of them, each of them having 2 extremities, yielding 6k 
extremities at most. There are 148 extra adjacencies in each unit subtree, 
having 296 extremities. Each unit subtree is repeated 



fclog((n-3)!)+3fclog(2) 
log(7) 



fclog((3fc-3)!)+3fclog(2) 
log (7) 



+ 1, and this is done for each 



times, upperly bounded by 

k clauses. 

Hence the input size for the SPSCJ problem is a polynomial function of 
the size of $. 

If $ is satisfiable, then there exists an assignment for which the number 
of SCJ scenarios is at least 



Y k x (((n-3)!) fe x2 n x 7 )' 



(33) 



If at least one of the clauses is not satisfied, then the total number of SCJ 
scenarios is at most 



Y k x ((( n - 3)!) fc x 2 n x 7 )^ 1 x ((n - 3)!f 



(34) 



Therefore, if <J> is satisfiable, there are at most 2 n — 1 assignments which do 
not satisfy the $, and the number of corresponding SCJ scenarios is at most 



Y k x (((n - 3)!f x 2 n x 7 ) fc - 1 x ((n - 3)!f x (2 n - 1). 



(35) 
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Hence if $ is satisfiable, then the number of SCJ scenarios related to sat- 
isfying assignments are more than the number of other SCJ scenarios. If 
an FPAUS exists for all most parsimonious scenarios, then it would sam- 
ple satisfying scenarios with more than 0.5 probability. Then this is an RP 
algorithm for 3SAT. An RP algorithm for 3SAT immediately implies that 



RP = NP (Papadimitriou, 1993). □ 



4-3. Counting problems 



The same construction is sufficient to prove Theorem 16 



Proof. (Theorem |16[ ) Assume that there is an FP algorithm for #SPSC J. 
Then for any 3CNF $, construct the above introduced problem instance 
x G #SPSCJ, and calculate the exact number of solutions. If $ is not 
satisfiable, then the number of solutions is at most 

Y k x (((n - 3)!) fc x2"x 7)^ x ((n - 3)!) fc x T (36) 

If $ can be satisfied, then the number of solutions is more than 

F fc x(((n-3)!fx2 n x 7 f (37) 



Since the number in Equation 37 is greater than the number in Equation [36} 
and the number of digits of these numbers grows only polynomially with |$|, 
given an FP algorithm for ^SPSCJ, it would be decidable in polynomial 
running time whether or not $ is satisfiable. Since 3SAT G NP — complete, 
it would imply that P = NP. □ 

Now we prove the counting counterpart of the same result, that is, #SPSCJ 
is not in FPRAS unless RP = NP. For this we need to define a more re- 
stricted problem. 

Definition 21. The #Fitch — SPSCJ problem asks for the number of Fitch 
solutions of an SPSCJ instance where pairs of adjacencies never share an 
extremity and the values of a set of ambiguous adjacencies are fixed. 

Although stricto sensu, the #Fitch — SPSCJ is still not a self reducible 
counting problem, we can prove that it has an FPAUS algorithm if it has 
an FPRAS algorithm. Before proving it, we discuss in a nutshell how to 
construct an FPAUS algorithm from an FPRAS algorithm for self-reducible 
counting problems. The description is not detailed, for a strict mathematical 



description, see Sinclair (1992). 
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The heart of the method that creates an FPAUS from a self-reducible 



counting problem in FPRAS is a rejection sampler (von Neumann 1951). A 
random solution is drawn sequentially travelling down the counting tree of 
the self-reducible problem, using the FPRAS approximations for the children 
of the current node, and at each internal node the sampling probability is 
calculated. The sampling probabilities are used to calculate the so called 
rejection rate, the probability that the sample will be rejected. The central 
theorem of the rejection method states that the accepted samples come from 
sharp the uniform distribution. To transform this into an FPAUS, the rejec- 
tion rate should be relatively small, so in a few (polynomial number of) trials, 
the probability that all trials are rejected becomes negligible. If all trials are 
rejected, then an arbitrary solution is drawn, but due to its extremely small 
probability, it causes a very small deviation from the uniform distribution 
(measured in variational distance). 

Lemma 22. #Fitch - SPSCJ G FPRAS => #Fitch - SPSCJ G FPAUS 

Proof. It is sufficient to show that the solutions can be put onto a counting 
tree such that the depth of the tree is 0(poly(\x\)) where |x| is the size of 
the problem instance, and for any internal node, one of the following is true: 

• The number of descendants of the internal node is 0(poly(\x\)) where 
\x\ is the size of the problem instance, and for each descendant, a 
problem x' G #Fitch — SPSCJ exists whose number of solutions is the 
number of leaves of that tree, and \x'\ = 0(poly(\x\)). 

• The number of descendants is 0(c poly ^ x ^) for some c > 1, but a perfect 
sampler exists that can sample sharp the uniform distribution of the 
descendants and the number of descendants can be calculated, both 
the sampler and the counter run in 0(poly(\x\)) time. Furthermore, all 
descendants are leaves. 

The algorithms in the second case provide that the protocol constructing an 
FPAUS sampler using an FPRAS algorithm described briefly above can be 
done also for those nodes which have suprapolynomial number of descendants 
but counting their number as well as sharp uniform sampling them can be 
done in polynomial time. Indeed, both sampling and calculating the sampling 
probabilities can be done in polynomial running time, and it is easy to see 
that the strict uniform sampling does not increase the rejection rate. 
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Fix an arbitrary total ordering of adjacencies. Let u be an internal node, 
and z is the associated problem to it. If there are ambiguities at the root 
of the evolutionary tree, then take the smallest adjacency with ambiguity 
and without a constraint, let it be denoted by a. Then u will have two 
descendants, and they are associated with a problem instance where problem 
instance z is modified such that a has constraint and constraint 1. 

If z does not have any ambiguity, then its assignment is unique. For 
this unique assignment, the number of SCJ scenarios along each edge can be 



counted and sharply uniformly sampled (Theorem 13), so these will be the 



descendants of u and also the leaves below u. □ 



The next lemma leads directly to the proof of Theorem 17 



Lemma 23. #SPSCJ G FPRAS =► #Fitch - SPSCJ G FPRAS 

Proof. Let ibea problem instance from ^Fitch — SPSCJ. Let A denote the 
set of adjacencies which are ambiguous at the root, but there are constraints 
on them. Let T denote the evolutionary tree of the problem instance x. 

We construct another problem x' , which has the same number of SCJ 
solutions but there are no ambiguities for those adjacencies which are in A. 
We remove each a G A, and introduce new, independent adjacencies. For 
any a G A, let E{a) denote the set of edges of T for which an SCJ operation 
is necessary with the prescribed assignment of a. We introduce |J5(a)| new, 
independent adjacencies in the following way. For each e G E(a), if a is 
generated on the edge, then let the corresponding adjacency a e be present 
at the leaves below edge e, and nowhere else. Otherwise, if a is cut along 
the edge e, let the corresponding adjacency a e be absent at the leaves below 
edge e, and be present at all other leaves. It is easy to see that the only most 
parsimonious solution for a e is to create or cut a e with an SCJ operation 
along edge e. Clearly, x G #SPSCJ, as there are no constraints on its 
adjacencies, the number of solutions for x' is the same as the number of 
solutions for x, moreover 

|x'| = 0(|x| + L4| x|T|). (38) 

Therefore an FPRAS algorithm for x' is also an FPRAS for x. □ 



We can now prove Theorem 17 
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Proof. (Theorem 17). From Lemma 23 



#SPSCJ G FPRAS =► #Fitch - SPSCJ G FPRAS (39) 



From Lemma 22 



#Fitch - SPSCJ G FPRAS =► #Fitch - SPSCJ G FPAUS (40) 
Putting these together, we get that 

#SPSCJ G FPRAS => #Fitch - SPSCJ G FPAUS (41) 



But from the proof of Theorem 15 It is clear that an FPAUS already for 



#Fitch - SPSCJ would imply that RP = NP. □ 

5. Discussion/Conclusions 

We proved non-approximability for a counting problem motivated by com- 
putational biology, whose optimization/decision counterpart problem is in P. 

The problem is related to the evolution of discrete characters: imagine 
a set of n independent characters from a finite set (a nucleotide sequence 
where all nucleotides evolve independently for instance), and a set of species 
related by a binary phylogenetic tree. The values of the n characters are 
known at the leaves, and the small parsimony problem asks for assignments 
at the internal nodes of the tree. Here finding one most parsimonious as- 
signment is easy, but it is also easy to count their number or sample them 
uniformly, when they are all independent, which is not the case for adjacen- 
cies in genomes. However, if the assignments are weighted by the number 
of most parsimonious evolutionary scenarios on the whole set of characters, 
then there is no possible efficient counting or sampling method. Indeed, in 
our proof all adjacencies are independent, so it applies to this more general 
problem. 

This study also highlights a counting bias in the parsimony SCJ model 
with independent adjacencies (or evolutionary scenarios on discrete charac- 
ters). For example, take a cherry with ambiguous values at its root. The 
number of scenarios is higher if the assignment at the root of the cherry is 
equal to one of the leaves than if it is a mix between the two. In an unbiased 
model all assignments should be equiprobable. This observation leads to two 
possible directions for future work: 
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Counting assignments. If all assignments should be equiprobable, 
then the problem is to count and sample in the assignment solution 
space. It is our unpublished result that counting the number of Fitch 
solutions to SPSCJ is in FP, but counting the Sankoff type assignments 
has an unknown computational complexity. 

Probabilistic models. The bias of the parsimony model will drop 
in a probabilistic approach. Here mutations follow a continuous time 
Markov model. In that case, each potential SCJ operation has an 
exponential waiting time for the occurrence. The so-called trajectory 



likelihood can be calculated analytically, see Miklos et al. (2004). The 
sum of the trajectory likelihoods is the total likelihood of two genomes, 
i.e., what is the probability that genome G\ becomes genome G2 after 
time t, given a set of parameters for the exponential distributions put 
onto the potential SCJ operations. The total likelihood calculation has 
an unknown computational complexity. 

We can also consider the probabilistic approach on a tree. In case of 
independent events, it can be shown that the multinomial coefficients 
describing how many combinations exist to merge the independent SCJ 
operations are cancelled out in the likelihood calculations. If all edge 
lengths of the evolutionary tree are the same, and all adjacencies are in- 
dependent, then the probabilistic #SPSCJ problem reduces to counting 
the assignments to the internal nodes of the evolutionary tree, which 
might have a simpler computational complexity. 

These are promising future directions of research, which can be important 
for comparative genomics. To close the mathematical aspects of the SPSCJ 
problem, two unsolved questions remain: 

• #P-completness of #SPSCJ. Our conjecture is that #SPSCJ G 
#P — complete. Theorem [16] strengthens this conjecture. Although 



#3SAT G #P — complete, the construction in the proof of Theorem [15 
is not sufficient for counting the number of satisfying assignments of 
$. For each satisfying assignment, there is a multiplicative coefficient 
that can vary between (^^)! 2 and (n — 3)!, and this shadows the exact 
number of solutions. 

Star tree problem. Given a set of genomes, G\, G2, ■ ■ ■ G^ related 
to a star tree, count and sample their most parsimonious SCJ scenar- 
ios. If k is odd, then the assignment for the centre of the star tree is 
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unique. It is proved for the median genome of 3 genomes by Feijao and 



Meidanis (2011), and their proof can be extended to any odd number 
of genomes. However, when k is even, then the median might not be 
unique, and there might be exponentially many solutions for the assign- 
ment. The computational complexity for this case is an open question. 
This generalizes to the small parsimony problem on non-binary trees. 
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